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Proposed online reenactment setup: a monocular target video sequence (e.g., from Youtube) is reenacted based on the ex¬ 
pressions of a source actor who is recorded live with a commodity webcam. 


Abstract 

We present a novel approach for real-time facial reenact¬ 
ment of a monocular target video sequence (e.g., Youtube 
video). The source sequence is also a monocular video 
stream, captured live with a commodity webcam. Our goal 
is to animate the facial expressions of the target video by a 
source actor and re-render the manipulated output video in 
a photo-realistic fashion. To this end, we first address the 
under-constrained problem of facial identity recovery from 
monocular video by non-rigid model-based bundling. At 
run time, we track facial expressions of both source and tar¬ 
get video using a dense photometric consistency measure. 
Reenactment is then achieved by fast and efficient defor¬ 
mation transfer between source and target. The mouth inte¬ 
rior that best matches the re-targeted expression is retrieved 
from the target sequence and warped to produce an accu¬ 
rate fit. Finally, we convincingly re-render the synthesized 
target face on top of the corresponding video stream such 
that it seamlessly blends with the real-world illumination. 
We demonstrate our method in a live setup, where Youtube 
videos are reenacted in real time. 


1. Introduction 

In recent years, real-time markerless facial performance 
capture based on commodity sensors has been demon¬ 
strated. Impressive results have been achieved, both based 


on RGB d8l|6l as well as RGB-D data l3UfT0ll2Tll4l[T6l. 
These techniques have become increasingly popular for the 
animation of virtual CG avatars in video games and movies. 
It is now feasible to run these face capture and tracking al¬ 
gorithms from home, which is the foundation for many VR 
and AR applications, such as teleconferencing. 

In this paper, we employ a new dense markerless fa¬ 
cial performance capture method based on monocular RGB 
data, similar to state-of-the-art methods. However, instead 
of transferring facial expressions to virtual CG characters, 
our main contribution is monocular facial reenactment in 
real-time. In contrast to previous reenactment approaches 
that run offline mmm, our goal is the online transfer 
of facial expressions of a source actor captured by an RGB 
sensor to a target actor. The target sequence can be any 
monocular video; e.g., legacy video footage downloaded 
from Youtube with a facial performance. We aim to mod¬ 
ify the target video in a photo-realistic fashion, such that it 
is virtually impossible to notice the manipulations. Faith¬ 
ful photo-realistic facial reenactment is the foundation for a 
variety of applications; for instance, in video conferencing, 
the video feed can be adapted to match the face motion of 
a translator, or face videos can be convincingly dubbed to a 
foreign language. 

In our method, we first reconstruct the shape identity 
of the target actor using a new global non-rigid model- 
based bundling approach based on a prerecorded training 
sequence. As this preprocess is performed globally on a set 
of training frames, we can resolve geometric ambiguities 
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common to monocular reconstruction. At runtime, we track 
both the expressions of the source and target actor’s video 
by a dense analysis-by-synthesis approach based on a sta¬ 
tistical facial prior. We demonstrate that our RGB tracking 
accuracy is on par with the state of the art, even with online 
tracking methods relying on depth data. In order to trans¬ 
fer expressions from the source to the target actor in real¬ 
time, we propose a novel transfer functions that efficiently 
applies deformation transfer ED directly in the used low¬ 
dimensional expression space. For final image synthesis, 
we re-render the target’s face with transferred expression 
coefficients and composite it with the target video’s back¬ 
ground under consideration of the estimated environment 
lighting. Finally, we introduce a new image-based mouth 
synthesis approach that generates a realistic mouth interior 
by retrieving and warping best matching mouth shapes from 
the offline sample sequence. It is important to note that we 
maintain the appearance of the target mouth shape; in con¬ 
trast, existing methods either copy the source mouth region 
onto the target 13011711 or a generic teeth proxy is rendered 
['14, 29], both of which leads to inconsistent results. Fig. [Tj 
shows an overview of our method. 

We demonstrate highly-convincing transfer of facial ex¬ 
pressions from a source to a target video in real time. We 
show results with a live setup where a source video stream, 
which is captured by a webcam, is used to manipulate a tar¬ 
get Youtube video. In addition, we compare against state- 
of-the-art reenactment methods, which we outperform both 
in terms of resulting video quality and runtime (we are the 
first real-time RGB reenactment method). In summary, our 
key contributions are: 

• dense, global non-rigid model-based bundling, 

• accurate tracking, appearance, and lighting estimation 
in unconstrained live RGB video, 

• person-dependent expression transfer using subspace 
deformations, 

• and a novel mouth synthesis approach. 

2. Related Work 

Offline RGB Performance Capture Recent offline per¬ 
formance capture techniques approach the hard monocular 
reconstruction problem by fitting a blendshape fl5l or a 
multi-linear face [26] model to the input video sequence. 
Even geometric fine-scale surface detail is extracted via in¬ 
verse shading-based surface refinement. Ichim et al. rm 
build a personalized face rig from just monocular input. 
They perform a structure-from-motion reconstruction of the 
static head from a specifically captured video, to which 
they fit an identity and expression model. Person-specific 
expressions are learned from a training sequence. Suwa- 
janakom et al. l28l learn an identity model from a collec¬ 
tion of images and track the facial animation based on a 


model-to-image flow field. Shi et al. [f26ll achieve impres¬ 
sive results based on global energy optimization of a set of 
selected keyframes. Our model-based bundling formulation 
to recover actor identities is similar to their approach; how¬ 
ever, we use robust and dense global photometric alignment, 
which we enforce with an efficient data-parallel optimiza¬ 
tion strategy on the GPU. 

Online RGB-D Performance Capture Weise et al. l32l 
capture facial performances in real-time by fitting a para¬ 
metric blendshape model to RGB-D data, but they require 
a professional, custom capture setup. The first real-time 
facial performance capture system based on a commodity 
depth sensor has been demonstrated by Weise et al. ED. 
Follow up work EUEHMtlS) focused on corrective shapes 
f4i dynamically adapting the blendshape basis [211, non- 
rigid mesh deformation do), and robustness against occlu¬ 
sions m. These works achieve impressive results, but rely 
on depth data which is typically unavailable in most video 
footage. 

Online RGB Performance Capture While many sparse 
real-time face trackers exist, e.g., [25], real-time dense 
monocular tracking is the basis of realistic online facial 
reenactment. Cao et al. @] propose a real-time regression- 
based approach to infer 3D positions of facial landmarks 
which constrain a user-specific blendshape model. Follow¬ 
up work m also regresses fine-scale face wrinkles. These 
methods achieve impressive results, but are not directly ap¬ 
plicable as a component in facial reenactment, since they do 
not facilitate dense, pixel-accurate tracking. 

Offline Reenactment Vlasic et al. [30] perform facial 
reenactment by tracking a face template, which is re¬ 
rendered under different expression parameters on top of 
the target; the mouth interior is directly copied from the 
source video. Dale et al. CD achieve impressive results 
using a parametric model, but they target face replacement 
and compose the source face over the target. Image-based 
offline mouth re-animation was shown in (5). Garrido et 
al. d propose an automatic purely image-based approach 
to replace the entire face. These approaches merely en¬ 
able self-reenactment; i.e., when source and target are the 
same person; in contrast, we perform reenactment of a dif¬ 
ferent target actor. Recent work presents virtual dubbing 
1741 . a problem similar to ours; however, the method runs at 
slow offline rates and relies on a generic teeth proxy for the 
mouth interior. Kemelmacher et al. f20l generate face ani¬ 
mations from large image collections, but the obtained re¬ 
sults lack temporal coherence. Li et al. [[22]] retrieve frames 
from a database based on a similarity metric. They use op¬ 
tical flow as appearance and velocity measure and search 
for the /^-nearest neighbors based on time stamps and flow 
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Figure 1: Method overview. 


distance. Saragih et al. (25l present a real-time avatar ani¬ 
mation system from a single image. Their approach is based 
on sparse landmark tracking, and the mouth of the source is 
copied to the target using texture warping. Berthouzoz et 
al. 0 find a flexible number of in-between frames for a 
video sequence using shortest path search on a graph that 
encodes frame similarity. Kawai et al. m re-synthesize 
the inner mouth for a given frontal 2D animation using a 
tooth and tongue image database; they are limited to frontal 
poses, and do not produce as realistic renderings as ours 
under general head motion. 

Online Reenactment Recently, first online facial reenact¬ 
ment approaches based on RGB-(D) data have been pro¬ 
posed. Kemelmacher-Shlizerman et al. G3 enable image- 
based puppetry by querying similar images from a database. 
They employ an appearance cost metric and consider ro¬ 
tation angular distance, which is similar to Kemelmacher 
et al. l20l . While they achieve impressive results, the re¬ 
trieved stream of faces is not temporally coherent. Thies et 
al. 1 29 1 show the first online reenactment system; however, 
they rely on depth data and use a generic teeth proxy for the 
mouth region. In this paper, we address both shortcomings: 
1) our method is the first real-time RGB-only reenactment 
technique; 2 ) we synthesize the mouth regions exclusively 
from the target sequence (no need for a teeth proxy or direct 
source-to-target copy). 


This prior assumes a multivariate normal probability distri¬ 
bution of shape and reflectance around the average shape 
Oid G M 3n and reflectance a a ib G M 3n . The shape 
Eid G M 3nx80 , reflectance E a ^ G M 3nx80 , and expres¬ 
sion E exp G M 3nx76 basis and the corresponding standard 
deviations aid G M 80 , cr a ib G M 80 , and cr exP G M 76 are 
given. The model has 53K vertices and 106K faces. A 
synthesized image Cs is generated through rasterization of 
the model under a rigid model transformation &(v) and the 
full perspective transformation II (v). Illumination is ap¬ 
proximated by the first three bands of Spherical Harmonics 
(SH) f23l basis functions, assuming Labertian surfaces and 
smooth distant illumination, neglecting self-shadowing. 

Synthesis is dependent on the face model parameters a, 
(3, S, the illumination parameters 7 , the rigid transformation 
R, t, and the camera parameters k defining n. The vector 
of unknowns V is the union of these parameters. 

4. Energy Formulation 

Given a monocular input sequence, we reconstruct all 
unknown parameters V jointly with a robust variational op¬ 
timization. The proposed objective is highly non-linear in 
the unknowns and has the following components: 

E(V ) = WcolE co i (V ) + Wian^lan {P ) ^reg^reg {P ) • 

V ' -V<- 

data prior 

(3) 

The data term measures the similarity between the syn¬ 
thesized imagery and the input data in terms of photo¬ 
consistency E C oi and facial feature alignment Ei an . The 
likelihood of a given parameter vector V is taken into ac¬ 
count by the statistical regularizer E reg . The weights w co u 
wi an , and w reg balance the three different sub-objectives. 
In all of our experiments, we set w co i = 1, wian = 10, 
and w reg = 2.5 • 10 -5 . In the following, we introduce the 
different sub-objectives. 

Photo-Consistency In order to quantify how well the in¬ 
put data is explained by a synthesized image, we measure 
the photo-metric alignment error on pixel level: 

EcoliV) = A E HCs(p) - Cr(p )|| 2 , (4) 

1y ' pev 


3. Synthesis of Facial Imagery 

We use a multi-linear PCA model based on EHHI2. The 
first two dimensions represent facial identity - i.e., geomet¬ 
ric shape and skin reflectance - and the third dimension con¬ 
trols the facial expression. Hence, we parametrize a face as: 

•Mgeo(£*5 = ^id H - Eib ' OL + E exp S , (1) 

M a ib ((3) = a a ib + E a ib • (3 . (2) 


where Cs is the synthesized image, Cj is the input RGB 
image, and p G V denote all visible pixel positions in Cs- 
We use the ^ 2 , 1 -norm c 3 instead of a least-squares formu¬ 
lation to be robust against outliers. In our scenario, distance 
in color space is based on £ 2 , while in the summation over 
all pixels an ^i-norm is used to enforce sparsity. 

Feature Alignment In addition, we enforce feature simi¬ 
larity between a set of salient facial feature point pairs de- 

























tected in the RGB stream: 

-^lan(^) = | -p\ ^ ^ ^conf,jf ||/ i -n($(t> i )||* . (5) 

11 fi^T 

To this end, we employ a state-of-the-art facial landmark 
tracking algorithm by ll24ll . Each feature point f j G T C 
M 2 comes with a detection confidence w con f j and corre¬ 
sponds to a unique vertex Vj = M geo {ot , 5) G M 3 of our 
face prior. This helps avoiding local minima in the highly- 
complex energy landscape of E co \(V). 


Statistical Regularization We enforce plausibility of the 
synthesized faces based on the assumption of a normal dis¬ 
tributed population. To this end, we enforce the parameters 
to stay statistically close to the mean: 




( 6 ) 


This commonly-used regularization strategy prevents de¬ 
generations of the facial geometry and reflectance, and 
guides the optimization strategy out of local minima 0 . 


5. Data-parallel Optimization Strategy 

The proposed robust tracking objective is a general un¬ 
constrained non-linear optimization problem. We minimize 
this objective in real-time using a novel data-parallel GPU- 
based Iteratively Reweighted Least Squares (IRLS) solver. 
The key idea of IRLS is to transform the problem, in each 
iteration, to a non-linear least-squares problem by splitting 
the norm in two components: 

||r(P)|| 2 = (||r(P oW )|| 2)- 1 • ||r(P)|||. 

'-V-' 

constant 

Here, r (•) is a general residual and Void is the solution com¬ 
puted in the last iteration. Thus, the first part is kept constant 
during one iteration and updated afterwards. Close in spirit 
to 129 ], each single iteration step is implemented using the 
Gauss-Newton approach. We take a single GN step in every 
IRLS iteration and solve the corresponding system of nor¬ 
mal equations J T J£* = —J T F based on PCG to obtain an 
optimal linear parameter update S *. The Jacobian J and the 
systems’ right hand side — J T F are precomputed and stored 
in device memory for later processing as proposed by Thies 
et al. [29]]. As suggested by (33 , 29], we split up the mul¬ 
tiplication of the old descent direction d with the system 
matrix J T J in the PCG solver into two successive matrix- 
vector products. Additional details regarding the optimiza¬ 
tion framework are provided in the supplemental material. 


6. Non-Rigid Model-Based Bundling 


on the proposed objective, we jointly estimate all param¬ 
eters over k key-frames of the input video sequence. The 
estimated unknowns are the global identity {a, f3} and 
intrinsics k as well as the unknown per-frame pose {<5^, 
R fc , t k }k and illumination parameters { / y k }k- We use a 
similar data-parallel optimization strategy as proposed for 
model-to-frame tracking, but jointly solve the normal equa¬ 
tions for the entire keyframe set. Lor our non-rigid model- 
based bundling problem, the non-zero structure of the corre¬ 
sponding Jacobian is block dense. Our PCG solver exploits 
the non-zero structure for increased performance (see ad¬ 
ditional document). Since all keyframes observe the same 
face identity under potentially varying illumination, expres¬ 
sion, and viewing angle, we can robustly separate identity 
from all other problem dimensions. Note that we also solve 
for the intrinsic camera parameters of n, thus being able to 
process uncalibrated video footage. 

7. Expression Transfer 

To transfer the expression changes from the source to 
the target actor while preserving person-specificness in each 
actor’s expressions, we propose a sub-space deformation 
transfer technique. We are inspired by the deformation 
transfer energy of Sumner et al. IZ7l , but operate directly in 
the space spanned by the expression blendshapes. This not 
only allows for the precomputation of the pseudo-inverse 
of the system matrix, but also drastically reduces the di¬ 
mensionality of the optimization problem allowing for fast 
real-time transfer rates. Assuming source identity cx s and 
target identity ot T fixed, transfer takes as input the neutral 
Sfj, deformed source S s , and the neutral target Sjj expres¬ 
sion. Output is the transferred facial expression S T directly 
in the reduced sub-space of the parametric prior. 

As proposed by E71 . we first compute the source de¬ 
formation gradients A % G M 3x3 that transform the source 
triangles from neutral to deformed. The deformed tar¬ 
get Vi = Mi(a . T ,(*> T ) is then found based on the un¬ 
deformed state Vi = Mi(a. T ,Sjj) by solving a linear 
least-squares problem. Let (io^i^) be the vertex in¬ 
dices of the i- th triangle, V = [v^ — Vi 0 ,Vi 2 — Vi 0 ] and 
V = [vi ± — Vi 0 ,Vi 2 — Vi 0 ], then the optimal unknown tar¬ 
get deformation S T is the minimizer of: 

|F| || | 12 

E ^ T ) = JZ A W-V . (7) 

i=l llF 

This problem can be rewritten in the canonical least-squares 
form by substitution: 

E{S T ) = \\A6 T -b\\l . ( 8 ) 


To estimate the identity of the actors in the heavily under¬ 
constrained scenario of monocular reconstruction, we intro¬ 
duce a non-rigid model-based bundling approach. Based 


The matrix A G M 6 I F I x76 is constant and contains the edge 
information of the template mesh projected to the expres¬ 
sion sub-space. Edge information of the target in neutral 
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Figure 2: Mouth Retrieval: we use an appearance graph to 
retrieve new mouth frames. In order to select a frame, we 
enforce similarity to the previously-retrieved frame while 
minimizing the distance to the target expression. 


expression is included in the right-hand side b G M 6 I F L b 
varies with 5 s and is computed on the GPU for each new 
input frame. The minimizer of the quadratic energy can be 
computed by solving the corresponding normal equations. 
Since the system matrix is constant, we can precompute 
its Pseudo Inverse using a Singular Value Decomposition 
(SVD). Later, the small 76 x 76 linear system is solved in 
real-time. No additional smoothness term as in G30I is 
needed, since the blendshape model implicitly restricts the 
result to plausible shapes and guarantees smoothness. 

8. Mouth Retrieval 

For a given transferred facial expression, we need to syn¬ 
thesize a realistic target mouth region. To this end, we re¬ 
trieve and warp the best matching mouth image from the 
target actor sequence. We assume that sufficient mouth vari¬ 
ation is available in the target video. It is also important to 
note that we maintain the appearance of the target mouth. 
This leads to much more realistic results than either copy¬ 
ing the source mouth region [ 30], lT]] or using a generic 3D 
teeth proxy [14 , 291. 

Our approach first finds the best fitting target mouth 
frame based on a frame-to-cluster matching strategy with a 
novel feature similarity metric. To enforce temporal coher¬ 
ence, we use a dense appearance graph to find a compromise 
between the last retrieved mouth frame and the target mouth 
frame (cf. Fig. [2]). We detail all steps in the following. 


Similarity Metric Our similarity metric is based on ge¬ 
ometric and photometric features. The used descriptor 
K = {R, T 7 , C} of a frame is composed of the rotation 

R, expression parameters S , landmarks T, and a Local Bi¬ 
nary Pattern (LBP) C. We compute these descriptors 1C S for 
every frame in the training sequence. The target descriptor 
1C T consists of the result of the expression transfer and the 
LBP of the frame of the driving actor. We measure the dis¬ 
tance between a source and a target descriptor as follows: 

D(K. t , /Cf, t) = D P {K T , /Cf )+D m (/C T , /Cf )+D a (/C T , ZC?, t). 


The first term D p measures the distance in parameter space: 

D p (K t ,K?) = - 8f III + ||R t - Rf ||f • 


The second term D m measures the differential compatibil¬ 
ity of the sparse facial landmarks: 

D m {K. T ,K s t ) = ]T (\\JT-J?h-\\r t s ,i-Ft s , s h) 2 . 

Here, U is a set of predefined landmark pairs, defining dis¬ 
tances such as between the upper and lower lip or between 
the left and right comer of the mouth. The last term D a is 
an appearance measurement term composed of two parts: 

D a (K T , /Cf , t) = A(>C t , /Cf ) + w c (K t , K f ) A(r, t) . 


r is the last retrieved frame index used for the reenactment 
in the previous frame. Di (/C T , /Cf) measures the similarity 
based on LBPs that are compared via a Chi Squared Dis¬ 
tance (for details see ED). D c (r,t) measures the similar¬ 
ity between the last retrieved frame r and the video frame 
t based on RGB cross-correlation of the normalized mouth 
frames. Note that the mouth frames are normalized based 
on the models texture parameterization (cf. Fig. [2]). To fa¬ 
cilitate fast frame jumps for expression changes, we incor¬ 
porate the weight w c (JC T , /Cf) = e -( D m(tc T ^f )) 2 < We 
apply this frame-to-frame distance measure in a frame-to- 
cluster matching strategy, which enables real-time rates and 
mitigates high-frequency jumps between mouth frames. 


Frame-to-Cluster Matching Utilizing the proposed sim¬ 
ilarity metric, we cluster the target actor sequence into 
k = 10 clusters using a modified k-means algorithm that 
is based on the pairwise distance function D. For every 
cluster, we select the frame with the minimal distance to all 
other frames within that cluster as a representative. During 
runtime, we measure the distances between the target de¬ 
scriptor JC T and the descriptors of cluster representatives, 
and choose the cluster whose representative frame has the 
minimal distance as the new target frame. 


Appearance Graph We improve temporal coherence by 
building a fully-connected appearance graph of all video 
frames. The edge weights are based on the RGB cross¬ 
correlation between the normalized mouth frames, the dis¬ 
tance in parameter space D p , and the distance of the land¬ 
marks D m . The graph enables us to find an inbetween 
frame that is both similar to the last retrieved frame and the 
retrieved target frame (see Fig. [2]). We compute this perfect 
match by finding the frame of the training sequence that 
minimizes the sum of the edge weights to the last retrieved 
and current target frame. We blend between the previously- 
retrieved frame and the newly-retrieved frame in texture 























CPU 

GPU 

FPS 

SparseFT 

MouthRT 

DenseFT 

DeformTF 

Synth 


5.97ms 

1.90ms 

22.06ms 

3.98ms 

10.19ms 

27.6Hz 

4.85ms 

1.50ms 

21.27ms 

4.01ms 

10.31ms 

28.1Hz 

5.57ms 

1.78ms 

20.97ms 

3.95ms 

10.32ms 

28.4Hz 


Table 1: Avg. run times for the three sequences of Fig. [8j 
from top to bottom. Standard deviations w.r.t. the final 
frame rate are 0.51, 0.56, and 0.59 fps, respectively. Note 
that CPU and GPU stages run in parallel. 


space on a pixel level after optic flow alignment. Before 
blending, we apply an illumination correction that considers 
the estimated Spherical Harmonic illumination parameters 
of the retrieved frames and the current video frame. Finally, 
we composite the new output frame by alpha blending be¬ 
tween the original video frame, the illumination-corrected, 
projected mouth frame, and the rendered face model. 

9. Results 

Live Reenactment Setup Our live reenactment setup 
consists of standard consumer-level hardware. We capture a 
live video with a commodity webcam (source), and down¬ 
load monocular video clips from Youtube (target). In our 
experiments, we use a Logitech HD Pro C920 camera run¬ 
ning at 30Hz in a resolution of 640 x 480; although our ap¬ 
proach is applicable to any consumer RGB camera. Overall, 
we show highly-realistic reenactment examples of our algo¬ 
rithm on a variety of target Youtube videos at a resolution 
of 1280 x 720. The videos show different subjects in differ¬ 
ent scenes filmed from varying camera angles; each video 
is reenacted by several volunteers as source actors. Reen¬ 
actment results are generated at a resolution of 1280 x 720. 
We show real-time reenactment results in Fig. [8] and in the 
accompanying video. 

Runtime For all experiments, we use three hierarchy lev¬ 
els for tracking (source and target). In pose optimization, 
we only consider the second and third level, where we run 
one and seven Gauss-Newton steps, respectively. Within a 
Gauss-Newton step, we always run four PCG steps. In ad¬ 
dition to tracking, our reenactment pipeline has additional 
stages whose timings are listed in Table [T] Our method 
runs in real-time on a commodity desktop computer with 
an NVIDIA Titan X and an Intel Core i7-4770. 

Tracking Comparison to Previous Work Face tracking 
alone is not the main focus of our work, but the following 
comparisons show that our tracking is on par with or ex¬ 
ceeds the state of the art. 

Shi etal. 2014 tM : They capture face performances of¬ 
fline from monocular unconstrained RGB video. The close- 
ups in Fig. [4] show that our online approach yields a closer 



Figure 3: Comparison of our RGB tracking to Cao et al. 0, 
and to RGB-D tracking by Thies et al. |29|. 



Figure 4: Comparison of our tracking to Shi et al. (26). 
From left to right: RGB input, reconstructed model, overlay 
with input, close-ups on eye and cheek. Note that Shi et al. 
perform shape-from-shading in a post process. 


face fit, particularly visible at the silhouette of the input 
face. We believe that our new dense non-rigid bundle ad¬ 
justment leads to a better shape identity estimate than their 
sparse approach. 

Cao et al. 2014 0/: They capture face performance from 
monocular RGB in real-time. In most cases, our and their 
method produce similar high-quality results (see Fig. [3]); our 
identity and expression estimates are slightly more accurate 
though. 

Thies et al. 2015 fi29\l : Their approach captures face per¬ 
formance in real-time from RGB-D, Fig. [3] Results of both 
approaches are similarly accurate; but our approach does 
not require depth data. 


















Figure 5: Comparison against Face Shift RGB-D tracking. 



Input Garrido et al. 2015 Ours 

Figure 6: Dubbing: Comparison to Garrido et al. d. 

FaceShift 2014: We compare our tracker to the com¬ 
mercial real-time RGB-D tracker from FaceShift , which is 
based on the work of Weise et al. ED Fig. [5] shows that we 
obtain similar results from RGB only. 

Reenactment Evaluation In Fig. [6j we compare our ap¬ 
proach against state-of-the art reenactment by Garrido et al. 
d. Both methods provide highly-realistic reenactment re¬ 
sults; however, their method is fundamentally offline, as 
they require all frames of a sequence to be present at any 
time. In addition, they rely on a generic geometric teeth 
proxy which in some frames makes reenactment less con¬ 
vincing. In Fig. [7] we compare against the work by Thies et 
al. l29l . Runtime and visual quality are similar for both ap¬ 
proaches; however, their geometric teeth proxy leads to un¬ 
desired appearance changes in the reenacted mouth. More¬ 
over, Thies et al. use an RGB-D camera, which limits the 
application range; they cannot reenact Youtube videos. We 
show additional comparisons in the supplemental material 
against Dale et al. ifTO and Garrido et al. fl3l . 

10. Limitations 

The assumption of Lambertian surfaces and smooth illu¬ 
mination is limiting, and may lead to artifacts in the pres¬ 
ence of hard shadows or specular highlights; a limitation 
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Figure 7: Comparison of the proposed RGB reenactment to 
the RGB-D reenactment of Thies et al. l l29l . 

shared by most state-of-the-art methods. Scenes with face 
occlusions by long hair and a beard are challenging. Fur¬ 
thermore, we only reconstruct and track a low-dimensional 
blendshape model (76 expression coefficients), which omits 
fine-scale static and transient surface details. Our retrieval- 
based mouth synthesis assumes sufficient visible expression 
variation in the target sequence. On a too short sequence, or 
when the target remains static, we cannot learn the person- 
specific mouth behavior. In this case, temporal aliasing can 
be observed, as the target space of the retrieved mouth sam¬ 
ples is too sparse. Another limitation is caused by our hard¬ 
ware setup (webcam, USB, and PCI), which introduces a 
small delay of ^ 3 frames. Specialized hardware could re¬ 
solve this, but our aim is a setup with commodity hardware. 

11. Conclusion 

The presented approach is the first real-time facial reen¬ 
actment system that requires just monocular RGB input. 
Our live setup enables the animation of legacy video footage 
- e.g., from Youtube - in real time. Overall, we believe our 
system will pave the way for many new and exciting appli¬ 
cations in the fields of VR/AR, teleconferencing, or on-the- 
fly dubbing of videos with translated audio. 
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Figure 8: Results of our reenactment system. Corresponding run times are listed in Table [I] The length of the source and 
resulting output sequences is 965, 1436, and 1791 frames, respectively; the length of the input target sequences is 431, 286, 
and 392 frames, respectively. 
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